A Phrase Table without Phrases: Rank Encoding for Better Phrase Table Compression
نویسنده
چکیده
This paper describes the first steps towards a minimum-size phrase table implementation to be used for phrase-based statistical machine translation. The focus lies on the size reduction of target language data in a phrase table. Rank Encoding (REnc), a novel method for the compression of word-aligned target language in phrase tables is presented. Combined with Huffman coding a relative size reduction of 56 percent for target phrase words and alignment data is achieved when compared to bare Huffman coding without R-Enc. In the context of the complete phrase table the size reduction is 22 percent.
منابع مشابه
Phrasal Rank-Encoding: Exploiting Phrase Redundancy and Translational Relations for Phrase Table Compression
Wedescribe Phrasal Rank-Encoding (PR-Enc), a novel method for the compression of wordaligned target language data in phrase tables as used in phrase-based SMT. This method reduces the redundancy in phrase tables which is a direct effect of the phrase-based approach. A combination of PR-Enc with Huffman coding allows to reduce the size of an aggressively compressed phrase table by another 39 per...
متن کاملExploiting Directional Asymmetry in Phrase-table Generation for Statistical Machine Translation
This paper presents a method that can improve the translation quality of a phrase-based statistical machine translation system without the need for additional training data. The technique exploits the asymmetry of the phrase-table generation process during training. In our experiments we use the GIZA++ toolkit for alignment, and the phrase extraction utilities that are provided with the MOSES d...
متن کاملJoint Phrase Alignment and Extraction for Statistical Machine Translation
The phrase table, a scored list of bilingual phrases, lies at the center of phrase-based machine translation systems. We present a method to directly learn this phrase table from a parallel corpus of sentences that are not aligned at the word level. The key contribution of this work is that while previous methods have generally only modeled phrases at one level of granularity, in the proposed m...
متن کاملA Relationship: Word Alignment, Phrase Table, and Translation Quality
In the last years, researchers conducted several studies to evaluate the machine translation quality based on the relationship between word alignments and phrase table. However, existing methods usually employ ad-hoc heuristics without theoretical support. So far, there is no discussion from the aspect of providing a formula to describe the relationship among word alignments, phrase table, and ...
متن کاملPhrase Table Pruning via Submodular Function Maximization
Phrase table pruning is the act of removing phrase pairs from a phrase table to make it smaller, ideally removing the least useful phrases first. We propose a phrase table pruning method that formulates the task as a submodular function maximization problem, and solves it by using a greedy heuristic algorithm. The proposed method can scale with input size and long phrases, and experiments show ...
متن کامل